initialization strategy
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.98)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)
- Europe > Italy > Friuli Venezia Giulia > Trieste Province > Trieste (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.68)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Data Science > Data Mining (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
- North America > United States (0.14)
- Asia > China > Hong Kong (0.04)
- Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
Subquadratic Overparameterization for Shallow Neural Networks
Overparameterization refers to the important phenomenon where the width of a neural network is chosen such that learning algorithms can provably attain zero loss in nonconvex training. The existing theory establishes such global convergence using various initialization strategies, training modifications, and width scalings. In particular, the state-of-the-art results require the width to scale quadratically with the number of training data under standard initialization strategies used in practice for best generalization performance. In contrast, the most recent results obtain linear scaling either with requiring initializations that lead to the lazy-training, or training only a single layer. In this work, we provide an analytical framework that allows us to adopt standard initialization strategies, possibly avoid lazy training, and train all layers simultaneously in basic shallow neural networks while attaining a desirable subquadratic scaling on the network width. We achieve the desiderata via Polyak-Lojasiewicz condition, smoothness, and standard assumptions on data, and use tools from random matrix theory.
A comparison between initialization strategies for the infinite hidden Markov model
Cortese, Federico P., Rossini, Luca
Infinite hidden Markov models provide a flexible framework for modelling time series with structural changes and complex dynamics, without requiring the number of latent states to be specified in advance. This flexibility is achieved through the hierarchical Dirichlet process prior, while efficient Bayesian inference is enabled by the beam sampler, which combines dynamic programming with slice sampling to truncate the infinite state space adaptively. Despite extensive methodological developments, the role of initialization in this framework has received limited attention. This study addresses this gap by systematically evaluating initialization strategies commonly used for finite hidden Markov models and assessing their suitability in the infinite setting. Results from both simulated and real datasets show that distance-based clustering initializations consistently outperform model-based and uniform alternatives, the latter being the most widely adopted in the existing literature.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Romania (0.04)
- (10 more...)
- Energy (0.68)
- Banking & Finance > Economy (0.68)
Improving Continual Learning of Knowledge Graph Embeddings via Informed Initialization
Pons, Gerard, Bilalli, Besim, Queralt, Anna
Many Knowledege Graphs (KGs) are frequently updated, forcing their Knowledge Graph Embeddings (KGEs) to adapt to these changes. To address this problem, continual learning techniques for KGEs incorporate embeddings for new entities while updating the old ones. One necessary step in these methods is the initialization of the embeddings, as an input to the KGE learning process, which can have an important impact in the accuracy of the final embeddings, as well as in the time required to train them. This is especially relevant for relatively small and frequent updates. We propose a novel informed embedding initialization strategy, which can be seamlessly integrated into existing continual learning methods for KGE, that enhances the acquisition of new knowledge while reducing catastrophic forgetting. Specifically, the KG schema and the previously learned embeddings are utilized to obtain initial representations for the new entities, based on the classes the entities belong to. Our extensive experimental analysis shows that the proposed initialization strategy improves the predictive performance of the resulting KGEs, while also enhancing knowledge retention. Furthermore, our approach accelerates knowledge acquisition, reducing the number of epochs, and therefore time, required to incrementally learn new embeddings. Finally, its benefits across various types of KGE learning models are demonstrated.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Spain (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.66)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.63)
On Multilingual Encoder Language Model Compression for Low-Resource Languages
Gurgurov, Daniil, Gregor, Michal, van Genabith, Josef, Ostermann, Simon
In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% while maintaining competitive performance, with average drops of 2-10% for moderate compression and 8-13% at maximum compression in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct ablation studies to identify the best practices for multilingual model compression using these techniques.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (8 more...)